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Summary In order to provide a framework for understanding the molecular 
biology of Epstein-Barr virus (EBV), we are determining the DNA sequence of 
the virus and studying the organization of genes on the viral genome In this 
paper we report the DNA sequence of the EcoRl C fragment of the B95-8 
strain of EBV^ The large (approximately 13-6 kbj) deletion in this strain has 
been located by comparison with the DNA sequence of EBV isolated from 
Raji cells. The sequence has been analysed for possible protein coding regions 
and transcriptional control sites. At least eight large open reading frames are 
found, some of them associated with canonical promoter and polyadenylation 
sequences. The sequences of some of the encoded proteins suggest that they are 
membrane Proteins It is known that antibodies to major membrane 
glycoproteins of EBV can neutralize infecuon in tissue culture. A possible 
relationship between some of the cnqpdcd proteins and the major membrane 
glycoproteins of the virus is discussed'i' 



Introduction 



Relatively little is known about the molecular biology of Epstein-Barr vims (EBV) 
'1*7°.'*'?'! " '"'■""O"* mononucleos s and 

Li^r* 1 r fc.".?'f '^'"P''*"'" '"^^ nasopharyngeal carcinoma (reviewed by 
«nl Lh 1h '7= JT' ^" ""derstanding of the structurl of the viral 
'^\tT°^ °^ '"^'^ «P^««i°n ^'1 be extremely useful in developing 
vaccines o synthetic immunogens against EBV infection. Since EBV transforms some 
human cells m tissue culture an analysis of the structure and function of hi gene! 

3e«tandinf ?f '^f -^^^ an important contrLti to oS 

13^1 f *^ K "--^'"og^n^is. The virus also exhibits long-term 

ihTTl r'r ,'"^^"1^"<' P^°-'des a system in which to study the importanceTf 
this type of infection. The alternative growth patterns of the virus (latent and 

O73..,M3/83/O.0O2. .25 S03.(K,/0 ^' ^ ,,33 



permissive cell cuire svtJm^^^^^^^^^ .'"^J 8«=n°'ne and the lack of a simple 

tumour promoted to!n^^^^^ ''7™''"' '^'"^ "^^ '^""P="'^ «ter 

1978 I97Q^ To. ^ nducc EBV synthesis m transformed cell lines (zur Hausen et a! 

Restriction enzyme maoTforseS ./'n' Possible. 

been established' a d crnelbraX hav?^^^^^^^ ''''"^"'> 

for the fcoRF and rZuj . " "-ccombinant DNA techniques 

fragments (Hummel & Kieff I982ai UsfnD hthtw i V . restriction enzyme 
some EBV proteins have beek assJnVrf r . '^nslation. the genes for 

Kieff, 1982?) In addkion If r '"^^ements (Hummel & 

polymerase I ave l^^rocatedt hv^'. '7 ''''^^ '^^"^^"'^^ 

unique region (U,) resllS''.??"^'"''^" ^""^ ^"^'ysi^ in the small 

revealed the ex sten J^f dcSlTin7^^ '"'^ hybridization studies have 

Sugden . oL. EBV (Pntchett ./.. 1975; 

infecting marmoset lymohocvt^ urf h Pnv j' ^"^ established by 

the B95-8 strain showS a^ the^un.t , " '""^ « ^« know, 

deletion. Howe^vej' t^^c 7 r^^^'^^ -t carrying the 

and 8P220). which are antigenTcainild afe elr^^^^^^ '"-"^1 ^^^''^ 
in the B95-8 line (Thorley-Lawson & Geiiingcr 1980) ^ uniquely abnormal ratio 

n=striction map in Fig n and ^ Inr fi r ^ ^'^"^ ^NA (see 

with sequences^eten^iniTn Ra i EBv'^^^^^^ «-P^rison 

present a detailed analysis of tL no„iL . J ' "^''^ "^"^ ^e 

sequences found in thi Tgion of ebV ' '"'^ P^^*^"''^' transcription 



t See footnote to p. 21. 
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Materials ami methods 



(1) DNA recombinants 

Recombinant cosmids containing the EcoKl C fragment of B95-8 and Raji EBV 
DNA were obtained from B. E. Griffin and J. Arrand. ICRF Uboratories, Lincoln's 
Inn Fields, London. DNA purifications were carried out as previously described 
(Birnboim &. Doly, 1979). Recombinant DNA inserts were isolated by restriction 
enzyme digestion followed by electrophoresis through LGT agarose (Wieslandcr 
1979). A random subclone library of the EcoRl C fragment was generated by 
somcation (Deininger, 1983) followed by cloning into the bacteriophage MI3mp7 
(Messing et al., 1981). MI3 mpg or mp9 (Messing & Vieira, 1982) vector systems. 

(2) DNA sequence analysis 

,^'!!,^o'^^"""^**^ templates prepared from the MI 3 subclone library (Sanger et 
'980) were sequenced by the dideoxynucleotide chain termination method (Sanger 
eta/.. 1977; Sanger <fe Coulson, 1978) using a complementary synthetic oligonucleotide 
pnmer (Duckworth et al.. 1981). Sequence data from the M13 subclone library were 
aligned and overlapped by computer (Stadcn, 1982). The complete sequence of both 
strands of DNA was established. Completion of the sequence required some non- 
random sequencing. Initially. MI3 clones were not obtained surrounding an Eco K 
site, which happens to occur in the EcoRI C sequence (Escherichia coli JMIOI is r*) 
Clones covering this region were generated by isolation of restriction fragments 
covering the site and forced cloning into Ml 3. A total of 536 sequencing gel readings 
were performed to obtain the EcoRl C sequence. The toul length of sequence read wi 
1 12.581 bases. On average each nucleotide of the sequence was determined 6-5 times. 

(3) Nomenclature 

Because the large open reading frames and potential transcription signals are found 
scattered on both strands of the DNA we have used the nomenclature R (rightward) 
strand and L (leftward) strand. Reading frames on the R strand are named RFI-5 and 
those on the L strand are named LFI-6. In order to avoid confusion, in the future we 
propose 10 prefix these names with the initial letter of the restriction enzyme and the 
fragment number, e.g. EC - for EcoRl fragment C; BX - for BamHI fragment X. The 
in ,iiro promoters are designated R or L (for rightward and leftward) followed by a 
Hr.^/ p °H '° "T*^ °^ "^'"8 two nucleotide numbering systems 

th o .1. A 'T '' ^ numbering for both strands 

throughout. Although most groups use a map of the type shown in Figure I. we note 
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iofnf '^^^ ^"^^^ ^" "PP^**'* ""^P of ^l^* genome (Hayward er al.. 



Results and discussion 

(I) Sequence analysis 



Because of the lack of genetic daia for EB V. we have adopted the approach of first 
esubhshtng tts DNA sequence then using this in a predictive fashion to design furthe 
cxpcnments to understand the coding and expression of the EBV genome 

£«RI cV^'Jur" '^T^ Tl^''' ^'^'^""'"'^ sequence of the 

EcoRl C fragment (F.g. 2). Marked on the sequence are major open reading frames 

ZT\ZT" t'^J"-^°"^' P«'-"a' Po'y(A) additiJl^ sites (aIIaa?)' 
T9m rl/ ^^"^ '"'""'^ transcription (Farrell al . 

shown in a simplified form m Figure 3. Although smaller coding regions whether 
genes or .nd.v.dual exons. are possible we have excluded those o^n f adL framS 
xa2ed .rS' ""de^ 200 codons in length. We hm aTso 

exam ned the codon usage of the reading fmmes to see whether they have a distinctive 
preference m order to try to predict coding and non-coding regions 

T DNA sequence without any other biological 

Ahhn,,!?.^^ ^ "''"P"' '''^^"ovirus, reviewed by Tooze. 1980). 

Although there are specific sequences associated with transcription and RNA ipllcinc 

ccaII^ T^ 'T"'*' '"^^ '^^'^^ (Corden *r 1980) and the 

SttL^'rhS''"""""'"'- '980) of eukaryotic promoters. In some easel noZ^ 
nSi.riM be found in eukaiyotic promoters, e.g. SV40 an5 

Sh nk ?9«h Tr'"" ^' ^""^ promoters (see review by 

of CAO rri Ir'r ^"^ T'^fr '""^ 'P"«= acceptor junctions 

MounffoL^J h"" ? \G/G respectively (see collection of sequences by 
S ; M ^' T^? '"'""'''"'^y »° ">ese sequences makes their prediction very 
d^cult. Nevertheless, when these promoter and splice sequences are near to the7r 

nST'^vT",""'* P^""' combinationTwith open reading frame 

and the po yadenylation sequence AATAAA. then their presence can be hi^hlv 
suggestive of transcriptional units and protein coding regions ^ ^ 

rarieu ei fl/ arc shown with the canonical promoter sequence TATAAAA indicateH hv 

delcUon point as compared with Raji DNA is ,hoIn ^ * "c complem.».ary b,s«. Th* 895-8 
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•"l'lfl»»l»tll«llll«lfl,,,,l^ 
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t f9*l«tCtA41tfrrTr iA(t*}IMtIII 

«AtrccTt<Ac»cctcTcccicc«4AttucrcAcictAACbuccctccA(a(CAA£ecctACfctct(rcitc<cA{cicAA4tccA 
<♦*• It** tMO 4t«a i«o« KM ref« 

(mCt«CCtCCGAA4ACb<ACSCCI(AC»A«tCA4A1ICt(CCUU«l«(6KrtCCCtrtttU(IUCrCGt4IC4AC(ICAWI 
> tH t(.t(l.>l 

rC[{; iKllft* 
ltttrl4CeA4CAAtl0Ctt«AtAUC4CtrcCCM&A4AtTA4A{ITI«ITCttTCIIt£llCCT«ICCAttC*CAtC4filAltCACCC 

7ii4 ro4o rcM rolo itTo km ion mm mio 

UAAAAC«CICCn4ACCACTIAr(rt(MA(CCA4tlllCArflCAAAIAACGCAClAtCllMACA4ftlAACtCrCCCCAIA4CICfc4 
».♦♦*« • I * I )« c 

I {a4 ((aft 

CII(t4CtCC(ICC[CCA«CCCCCCA«CClCCtCA«tCAC<ACArCA1AAI4AttAeCUCltC«ACtCACCACUCUA«CICAn4e 

Mta iijo M4I niD n4« m7o nto /ik nto 
tuMCMtUAC«wrc64CCMtc4Sitt4trcrcrctitutr*tTAci»citcrc6r4CtttctsT«i4ic«isitCAcrA*it 

^•••tMttVitlCIl I IILI trirtACI It 

ttC(Aa6«CCUCI(tACCCA44IUtcrtACMCC(CAC4<AICCAIt«C,Utt&aCtCA{CCI(A<}KCfUltt«ICMACcnT 
... . "^O 'T»4 ItH IT10 

ctt4»tct44ticAcn(Kircc4«»4AircccecctictciAMijiAeciiiictc«seitcc*(U»tcs»AUitciticr«A*« 

UlA*ClCC((4C(ttICC«4CCttt(ACt«4r4t1l«l«6TGI6UA£ACUITCr(6t4G«CCCUUCt«(1«CttaCAIA4CACI 

1140 III* rin iiic 1140 ntt rnt mao 

rCMI«ICCC«(«4AIAaCCUA<ACI4ACCACaAiCACCACACCCICrCirUC*(CACfCCC6AIACCAIClCCI*CCtlAir4tCA 
" ■4»III«JTlll»*l IIAftfT 

AU(tCCAIAC«CIUC1CAKStU««MC4eArCtllMCGIC|TCrTCTIUCCtaciU((AA1T4tA«rUACCCCUAC(t(AC 
7)t4 1404 MiO Ut« ft«e MIQ MM ?4I0 

IICIACtTllC<M(ICACICtCAKCC>«CCGTAC(«?AACCAfiCACtA414t4441C4ICrCilTA*CGt<ACtICUCCCICCt(t> 
'""••Tl lll*HtqiO[|«Alt«citTC}AT 

«c rcc A A I n< f CAc FG rc At A I i4t*c4 1 c lAc T AAcc rce a tccAt aacmk ac a nt c w a i i a ic a ca* a< ac m abs rt n ««a 
;44« JIM lilt uio rifi ;tio M4» ;tu nii 

ttACIIIAAAMCIUCt(r«IAAC«IICACAltArT4CAC6llCUSUCCCt4IGrAAcaCltAluiCICI(UUfUACAACCt 

iCNitrtict trcA«Aif*Rri liiiiiti 

IICAITIACCUIICTACtGIAACACIiflAIKTAACCCCtGf I16tTl1AIGICIICC«tr6lt«4AA4r4GCtACtlCCAA«|4CA4 

Mri MIO rtie 1400 ;iio itiQ mm ftii iiio 

C46IAAAlCGGTAA|AtttCkrfCCGAIA<tAA<AIU6CUCAAICtAU1ACICAIWCAACAACCnCAKCAIIt(t4TrcCC«IC 
>>-t'"T«CI1t4liKtllAfAiri 
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< llirl |C-lf« jj 

4K«r«ructc.««c,KCInT.ucc,ceum«1c»*^?S.cecu:^^«xM;;c^rcu.?^7^.uc«^'fr 



cc-iri 



tt-tf 1 



Fio. 2, continued 
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< i!. U.W ' ' * • * ^ ' * ^ ^ ^ ^ I * M 

.c.,cj..„,„««,«.,.,c,c,,j,,.,,„,„ ««;««.«K,,.,c„<<,<..«.u.,«,.,„„, 

.«;.«..-,.,««,..j,..^.,„.|j!!t„,,.- 

n„.cj«,.,c«.«j.„.,5!j:s,. j ;; , |.. . .J 
Fig. 2, continued 
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*Utttij«KI(C*UWM(£II««t«ttt*i£lIttr(£««a(U«iIC^^ 



IIIM IITM tfft* ifiu i>*u 

..<j.c;.^c..Kcc««ti:!|S?<«cJ ^^^^^ 



Fig. 2» continued 
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IK [CIM » 

• ■tcrCTCLtPCHrmrNACP I«i«pa4* Ct.if4 
TUtCC0CCCTUC«rAtft«ttACUCtCCIUCUAACACKtrCXCftCiC»t«C1fllC<CCATCUCCnUUCt4CCCt«UCCTTC 

\iW UflO l»» IHIO 1)140 1)*W IMM IM'f 

tCrcCM6CSACCCCATCriCArC(CCACC*CiCetHCTICAA61St6TGt«C&ACCUCtT*«CCCC*»CtCUCC«4CTCATCCUC 

^AlATF r*QQtrill«CT|A«DAKI>«lLct CCLFl 

ltCAC6ATICCACArCCCACCftarACCCC1tntAICCCA1MAAr«ACUCCCl'ttrCIC«TCCCrEUCftCC6KTCSA<rATCICa 

i>;»o lUto itriD iiiM DM* tvw ilTM iii;d 

fiCTCCTACCCTMACCSrCCTcrAfaCftCACAAAriOOCIACCrtACrCCCCCfilAAAtAOCACBCAtAICIUCAtACCTCATtCACC 

rllCfflTCQtlSPIVACftHtlfTlLlir tCLfl 

IACACAIMAA(rCCASUCCCTrCCCtGaKT<TCr«CCrCCCCCCTiaCCCT6CCCTCTr«MClCftCTU(»CCACCACA1WCCirT 
DIM l)7«a IIVM Dil« DttO D«M l]t40 D«M DKO 

ATCr«IACCrTUC«rcrCCC«AAMCACCCACAMCCCA»CC(CAA<WC*C«SlCAACCCCIftCUttWCCtCSrctA{(UAAA 

VMftLPI«*<TftAC»iA1CC«At«IMV«tCC (ClfX 

(CAICCTC«rCACCCCCCACClCC(TCiecUITCT1MtACAA6UCACSMU(CICAC«CCCCCCCAtAT(CACCACCCCTSCUC6 
DS;0 DIM into DtOO D11» D»M DtlO Dt« 1)*M 

««TA«A«AatMS«tCTwrcaCA<tCCCUA6AACCATCTICCTCrCCCCCCCCACfCC8CCMCtrCIIWT0tt««WACCCCCC 

oco»avftLf»4vrtLPf4'C*«*vfiAA6 ic-t;t 



CCI«t«rCTCe*»AeACTttsaAftAaAC«CA«TACTCCtTCICl»«ClTTTrClC»AMtCirrArTA«TCTrAflBAAi««TtA 
D*M D»70 DIW IJHO 14M0 lAOK 14019 MOM H040 

C6ACCACA(ACCtircl6AA(CCHACUreCCUCAT6ACCAACAtTCCfiAAJlA*(rCTrcCMUlAIUTCC*f*ArcCtriCCC*«1 



TACAACACICCCCCACACATCClfAAAACTCAATt fC fCCCICCCCAAbCACTSCCMUCOO I CTCTUAtCcrCArAtCCrCI TCf CC 
li«lft 14040 I4»7« 14010 140t« I410> t«lt* HtIO 14D0 

ITCf TCTaca6WTC1CUCCAItrTT«*CKACA4ACWACtKTtC«TCAC«CCCCCCCCAU«ACCUCUCriTaCAtAAtACC 
ril&IN •r[ltAlCLt.AAAT{r>N1IIAI 

St«rl tC-irt ( < tmi tC*KH 

CIAtOCCIUAAAACCCTCCCtSlcreCCrSCUTtCICfeCICCCCACrCACBirCCKCCMCCTCtCAUTCCtCtCfTTrsCTCCr 
14140 I41M (41(0 t41fO I4ia0 l4tM UIM KtIO I42t0 

CATCCCC4CCniTaCCA«CCACCCACaaAC6irACCACACCACC»fi1tACTCCC*e«CCSCCCCCAUCIC<ACfiAAA«UAtC«ACCA 
I KrrtOI«Aqii« SAfSrC«AIQAAIC«CO 

CfirTTI«CIC«lfiCAtC6CTAA»tOAiaCCTTcrU.trCICTCTUCIAArC&«AI[lArOCCt»CACCCfCACCA«A«Ur(rTSCt 
HIM I4r44 lAttO I4ZM t4i;0 14(M 1*7*0 I43M 14)10 

CC«AAAC<AarACftTCCCS4rTaiACTACCUACAfIAACACAMAtCArrACCCTAAtIACI»ACC1»CA4IMrCfCCCfA<AaCiA 
■ 0{«tAlltlA«(.C T I \ I f * I OfSttSAHQQ 

CA«CCAftCTCCIC«ArcCCtMCtA«A(CArcr6UCC««CT0CIlCaACATCT«CCCCCr«CatCTCCTCiC«AAATUCCAI|ATCA 
t41t« MIU 14)4* 14)S0 )41(0 M)n I4in )4>«* U400 

CTCUTC(A«aCCtA«a»«CCAKICfiUIAC«U«CCCACCACMfCrAftACCCC(CAC»CWAtCACtCCCnTAnafiICCrACr 
AiEilCPiinqTrOQliQifiATisivi.tr 

CCACCCC»CCtCCAaaACCA«AlUACCAISCCaCCCAACCUTrTnaAAaAA«kA4ATCAAACCACACUCA<CCTectAAACAAAC 
t4410 144» 14430 1*440 I44U) M4M lAtF* I44M |t4«B 

ccrcM««ci«rcciccrsarc)TAcrcsTACfi&c*«ctfccccAAAAACTrcf rccrcrACTTiccicicr«*TtcMCUtruriTft 
vCAiLviiiHCirriiFfii rctriitFic 

CCCCCACCC(CCT«ACrAU[|UI(ATCCACrSACCCACttCTACCCA«ACTaiCC>T«>C1TCCCC(A(«CCTCCAC«AAtrCATtrc 
I4fl0 t4)l< 14S7« 14$M I4»40 14SM 14|*0 t44T« HfOO 

<M«aTS(CACtACTCATCCAACCACrACCTIteTC6tTaCIATe66KTaACA«OTATTCAAS«C«rcCWUMt6CrTAACIAAA« 
CVItfLiri t^CVlCLSOHLCtLIOvrQIK 

TrcccTTtaAeicmArT«r:cAAATccifcc«ua6CCcccuTtfTiuc»:ciuAActTtrAcrcKuucAr«cccrccAMr 

MiM )4(*« U«l« Utte lAfJO M«40 |4tU Ul»0 HtTO 

AAUCAAACT«r6AC«TAAC*eCITrA«6AAMCtTCC«4CC*CfACAAAAC«CltACCrrCAACAT{A««CCrtCrACCUASCrCCA 
CNSTAIIllClltAIIQAQF ITflf l6tLD 

CAAAIA£ITTUACICAC«CT«trCCrCCtCTCA(tAC ACCTCCiCCOACCCAAA^rcU r«1 TCrtCATCMtUaiTCriTABTCACA 
I4IM 14lt« 14)00 t4M0 icrro I4TM 14740 l4rS0 14rlO 

CTITCtaCAAarCCcr(CUCUCC*GCCCACKArCTC6AUICCCTCCSttfCA«flACAA(ACCIACrCCCrCCACAAAtCACICT 
r*HIA«0tOI(Tl[LtAf0tltlltrKlII 

TCAAurcticAiMTeecAArccccrccucrccATecitrtAAACtcciHTAerctrrcTACACsiHArcfcBrreccuAcrtcA 

14rr* 14IM 14M* 14*00 14110 MIN MIIO U|40 I4«S0 

ACITC(ACACC1CCCACC<1[ACtfiCU4IC«A9CrACCAAAArTTCACCACCATU(CAACATCtBCACtTAtA«CAAC6CCCTBACCT 

fitirtitSLtl icrirHi*ir¥Kitl«iQf 

ACrAfiTACTUCrCITCCCCIC«CACACC1CCCTCATCrrrTTrCT«AfiEAABArcrC(TTfirtcar«CCCl*CTC1CCCICGIA6QrCI 
H«M 14«ra 14l|d 14BM 14100 |49l* 14»t0 I4f)0 14*40 

fCATCAfBACCCACCA<l[BltCC«tETBtUCCA(TACAAAAAACACTCCTTCTA«ABCAACACCC«CUSfCftACAKCAUAUCA«A 
riot rAttTtlMtltlf ICIDtClOtfl 11 

T«AUTC«nCAfAAACCItUC«ACtCCACUC(CIC«JUTA«aCATCCTCTCCCACCCAC«CACCCieUCCTCrrCC«CAS6BTCA 

t4tfg i4»4« ittTo i4«w u«t» itoi* ]»«io ise» 

ACCACAtCAACtATITCUCT1C(1BICBrCCaC(CCCfCAfC«UTACCACASCC1C»TCCCTtMAtIACa««AAC0CI1CCCACT 
rOKlfSFlTLriSTCa rtlCPVIKSIIlT. 1 

CC6TfiCCCrMrTAACCCGCU«CACrCCCAtACTCAftArc*C«fCCCCCACBCKTfCCCC8CCACCCCCITACCCIACATaCI«CACA 
ISOiO ISOU D040 liOJO DOH liOfO lt)O0 DUO I«lt0 

CCCACCMIttAAiraCCCfifGUTCACCCICraAC TCTACTCtAMCCarCCICaAACCCCCMtMCUAAtCECATCUaACCm 
TAQi»frCqS»l tia<lACAAfAlBT lltit 

T«AC6aT««TT«aATT*4rCTttaTUtTrCrciCABCtCCArarTCUCCtCrrCTICUCA»CAtCAttCCC»WAAfi«1CTUCA 
lltM tll4« Iflid DI40 DirO DtBO DIH Dm DltO 

ACt6CCACCUCCTAAUA(AACCAAICAACA«*«TCBrMIACAA&ACUA«AA«ACBAMTCCCICCTCC««tC4CBrrCCACAtMr 
»T''«'*Tltlttl>fC tI0Clt»AlAL64l 

QCAT«U0TT*ATCT»CUCCCAOtaA*TC(IACCCAAATTCCArCf««A(CI1MCa4aA(TBMIA«ltTaCCCAWCACT1CCC«l 
DIN Dt)0 D240 DIM tflM Di;0 IDM D»0 tDOO 

COtACICCAACrAUCCCCCIUTCCCTCABCAfCUTITAACCTABACCrCCCACCaCCCTAACAAtlCCtACUdCCCraAAMaCC 

«t<14ttit0TAf4i^^0*tAfiaiT«iiKar 

C«CCe«r«l|««CC«CCMTir<t1CtC«riCCCCtCArCCCftCCICCBftC«CCICAIAA<««CtKUM4KtlCCCEC«tBCCICM 

li))* urn D))0 li)40 ItliO DMA IDf* tSlBO ll)*0 

CCC4«tACCCCCC01C4CCACACCACtBCAA«SMCCIAa<MMA«CCtCCceAltCMBCC«CCfCCCCUClAaMaCCC«BCaCC 
Af^^f^irAIBAOailttiTAAfTltlAAl 

Fig. 2, continued 
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tC-tf ) 



tc»M«i«cc*A6ciAif*At*octict«re«rumIIiMiuuB«Mcci«ctcriecT«iA^^ 
tcuuicuAtccA|«jcrftTTe*tcT«rcijtcm^ 

GCCAtCACAACCAiUCT 

ci«tmt..t«u|ja.«i>j,«M..i,c|t«,„«.„^^^ 



ccin 
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R1 



Lfi LFS 



Figure 3. Organization of the EcoRl C fragment of B95-8 EBV. Open reading frames (RFI etc) are shown 
by horizontal thin arrows, AAUAAA sequences (vertical Ihin arrows) and the position of the deJetion 
relative to EBV from Raji cells arc also marked. The three promoters LI, L2 and R are indicated. The scale 
is in kb. 



(2) Codon preferences 

EBV has a G + C content of 58% (Weinberg & Becker, 1969; SchuItze-HoHhauscn & 
2ur Hausen, 1970) and the nucleotide composition of the EcoKl C fragment matches 
this closely (A 20-6%, G 29- 1%. C 29-4%, T 20-9%), The pattern of codon usage in 
open reading frames detected in the DNA sequence can be used to predict whether 
these are actually translated. For any particular amino acid, the first two positions in 
its codons are nearly always fixed (except for the 6 codon families: leucine UUR, 
CUX; serine UCX, AGY and arginine COX and AGR), so with a fairly random 
amino acid sequence the composition of the nucleotides in the first two positions will 
also be fairly random. In a coding region with a strong G -f C bias the excess G + C 
nucleotides will tend to be concentrated into the third position of the codons. So the 
nucleotide bias tends to be exaggerated in the codon usage and may produce a highly 
characteristic codon usage pattern. If the region is not being expressed into protein 
there is no such constraint pushing the excess G + C nucleotides into the third position 
and the codon usage pattern will be quite different. Such a highly biased codon usage 
pattern has been observed previously in bacteriophage with 31% T. In a random 
protein sequence encoded by a nucleotide sequence with a 58% G + C content, the 
G + C content of the first two codon positions should be 50% and the G + C content of 
the third position would be predicted to be 74%. Table I lists the % G + C content in 
the third position of the codons in all the reading frames shown in Figures 2 and 3. 
The average G + C content in the third position of the codons in these readings frames 
is 69%, approaching the third position G + C content calculated for a random protein 
sequence and substantially higher than the 58% of the whole sequence. This highly 
non-random distribution of the G + C content in the reading frames make it extremely 
likely that they or large elements of them correspond to protein coding regions. The 
biased codon usage will be detected by the FRAMESCAN program (Staden & 
McLachlan, 1982), which analyses the codon usage of a known gene or potential gene 
and uses the characteristic codon pattern to predict other similar possible coding 
regions. Thus we can quickly scan EBV sequences for similar putative coding regions 
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TABLE 1 

Third position G + C in reading frames 



Frame 



Lcnglh 



3rd G+C 



Vo 3rdG + C 



EC-RFI* 

EC-RF2 

EC.RF3 

EC-RF3a 

EC-RF4t 

EC-LFl 

EC-LF2 

EC-LF3 

EC-LF4 

EC-LF5 

EC-LF6t 



186 
571 
606 
346 
290 
858 
I0I6 
313 
249 
608 
116 



US 
445 
425 
250 

173 
651 
832 
229 
144 
369 

71 



63-4 
77-9 
701 

72- 5 
596 
75-9 
81-9 

73- 2 
57-8 

60- 7 

61- 2 



Average excluding RF3a, RF4: 



69- 1 



+ r -J, ' r£!"w''^ ^'^""^ °«^'«P EC-I^P2 is included. 

t Sec text. po«ibly spliced, but whole frame has been included here. 

I Only that part of Che frame which is in the EcoRl C fragment is included. 

and also use it to try to detect intervening sequences. This is particularly useful when 
the mtervening sequence is "in frame" with the rest of the coding sequences and 
contains no termmaticn codons. Below we describe the major features found in the 
DNA sequence and discuss the possible organization and expression of coding regions 



(3) Organization of reading frames 
(a) Region 1-^500 



l^^VZT J I ^T""''' '^^3> "^^'^^ y end of the 

punne at -3 (Kozak, 198!) downstream of this promoter is found at the start of LF6 
position 350. one base beyond the termination codon at the end of LF? The 
ading frame extends mto the EcoRl H fragment and could code for a polypeptide of 
bLn .hZ K ' communication). THe EcoRl H7ra£ent has 

A at! A A ^'""^ ^'^^"'^ absence of a nearby 

iZt ^''^^^P*'^^ ^o"<^^ sequence at position 390 has a seven out of nine base 

-match with the consensus sequence of SAGGTPuAGT 

anotL?'meth1nn"n ^*^%'"^^'^>°t'u ^^'""^"^ ^^^^5 is overlapped with 

Doi; Tq^ ? ™^ ^eniains oji until 

position 2920 overlapping at its 3' end with another open frame RF2 which beeins 

iTfoifol^^^^^^^^^ ^'^^ -^-^^ trrminationS^^^^^^^ 

Lu^^nc^^^^^ downstream. A possible promoter 

sequence for RF2 is situated upstream at position 2616 with the sequence 
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TATTTAAAAA and also the sequence GCCAATT at 2546. Interestingly the 22 base- 
pairs centred on position 2620 displays twofold symmetry with only two mismatched 
base-pairs. This potential pronnotcr sequence was not detected by the in vitro 
transcription experiments ofFarrell et ai (1983), although they did find some low ievel 
transcripts emanating from this region. Expression of RF2 could also be effected using 
the potential splice acceptor sequence two bases upstream of the first methionine 
codon. No obvious promoter-like sequences can be seen near the beginning of the RFl 
and LF5 frames though potential splice acceptor sequences arc found at position 2105 
for RFl and at 2244 for LF5. Both of these sequences occur (in the complementary 
sense) within the coding region of the other reading frame. Thus the possibilities exist 
for at least the independent expression of RF2 and possibly of a spliced RFl -hRF 2 
frame. In the latter case a good potential splice donor sequence CAG/GTAAGC, 
which is an eight out of nine match with the donor consensus sequence, exists at 
position 2256 which is 28 codons downstream of the start of RFl. 



(b) The region 4500^000 

(i) RF3and3a 

The large open reading frame RF3 extends from the first available methionine 
codon at 5241 to the termination codon at 7056 which lies five codons past the 
AATAAA sequence at its 3' end. No in vitro promoter was detected upstream of the 
methionine codon although a potential promoter sequence TATTTAT occurs at 5035. 
Also possible splice acceptor sequences are found at positions 5059. 5131 and at 5218 
before the methionine codon at 5241. There is, however^ a promoter in the middle of 
the RF3 frame with a transcription start at approximately 5962-5972 (Farrell et ai, 
1983). The first methionine codon is found in frame at position 6021, Thus the 
potential exists to express the whole frame RF3 or the 3' half RF3a. 

(ii) LF4 

Twenty bases beyond the AATAAA sequence at the 3' end of RF3 is another 
AATAAA sequence on the L strand at position 7072. Associated with this is an open 
reading frame LF4 that could code for a protein of 248 amino acids starting at the 
methionine codon at 7839 and extending to the termination codon at 7095, 28 bases 
before the AATAAA sequence. The methionine codon is preceded by the sequence: 

L strand CACACCCCACCACATATTTAGG--3'. 

The pyrimidine-rich sequence 5' to the AG could serve as a splice acceptor sequence 
or the underlined sequence could correspond to the "TATA'* promoter element. The 
preceding overlined sequence corresponds to the sequence found in the L2 and R 
promoters and in a number of other eukaryotic promoters (Farrell et ai, 1983). The 
likely transcription start for this putative promoter would be centred on the 12 base 
twofold symmetric sequence 5' AGTTTTAAAACT 3' at position 7857. The amino 
acid sequence predicted from the LF4 reading frame contains many potential 
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glycosylation sites (Neubcrger « ai. 1972). in fact 12 out of the 14 asparagincs in the 
sequence are followed two residues away with serine or threonine. 



(iii) The B95-8 deletion region 

The B95-8 strain of EBV is missing approximately 13-6 kb of DMA in the BamHl I 
tragment when compared with other strains {Raab-Traub et al. 1980) Part of the 
missing DNA has been shown to be homologous to a region of DNA in the BamHl H 
" approximately 90 kb away. We have mapped the deletion point in 
the B95-8 sequence by also sequencing part of the Raji RI C fragment which does not 
carry the deletion. So far we only have sequence data for the 5' side of the deleted 
sequence so we do not yet know whether there are any repeated sequences involved in 
the deleuon event The daU. however, allow us to define accurately the deletion point 
rLTJ'^»'^*;^< f sequences. The deletion is between residues 9326-7 of the 

B95-8 EcoK\ C fragment. The Raji sequences will be published separately 

The region conUinmg the deletion point between the LF3 and LF4 frames does not 
contain any open frames longer than 127 codons and we have not yet found any 
,T onRr"^^TT «Pf««on. An AATAAA sequence occurs on the R strand 

at 9081 and the largest open reading frame close to this is 64 codons. A possible 
promoter sequence CATAAAA occufs on the L strand at this point (9094) Other 
possible promoter sequences occur at 9544 (CATAAAA) and at 9574 (TATAATGA) 
A repetitive sequence occurs between 8550 and 8940 and the homologous sequences 
at,, shown ,n Table 2. A sequent of 14 consecutive Gs occur in this region Tzm 



(c) The region 9470-17,172 
(i) LFI 



„„\v .h ? """^ °'" ^"^^ ^ ^'^etntnt sequence is taken 

up by three large open reading frames LFl-3. These are preceded by the LI promoter 
where the transcnpUon start has been mapped to approximately 16,650. The first 
meUjiomne codon preceded by a purine at -3 occurs at 16.636 at the beginning of the 
LFI frame. This frame ends 41 bases before the AATAAA sequence at 14021 
However, using the program FRAMESCAN (Staden & McLachlan. 1982) the codon 
usage pattern changes considerably between 15.550 and 15.490. and to a lesser extent 
to 15.374. even when using the LFI coden pattern itself as the input reference 

Sk^rTA7^J/!£.? ^ '^'^ P^^ot" ''"lances 

^^^P IL^'^^ TATAAAATAT at (15.518) as noted by Farrell el 0/(1983) 
but not detected as an m yi.ro promoter by them. This may indicate that this reion is 
spUced out of the LFI frame and lends some support to the idea t^a t th s regE an 
s^n'd a 'TT ''''''^ - ^y^'^- The "'on he L 

5 4M n^T 5 257taTwoT; """"" V'"'' P"^^""^ ^"^P'^ 
/r^ v ' "^"^ encompass the potential promoter region. If this region 

does funcuon as a promoter it would presumably splice to the readiS f^es 
downstream at the end of the EcoRl C fragment and in the D het iagiienf b ofr." 
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avoid the in phase initiation and termination codons in the 150 bases downstream of 
the possible transcription initiation point. 



(ii) LF2 



LF2 extends from the methionine codon (14.060) two bases after the LF2 
termmation codon. past the LFl AATAAA sequence to the termination codon at 
1 1.015. No candidate for a promoter for LF2 can be predicted so that expression of 
this frame may be effected by splicing from a leftward promoter such as LI A possible 
sphce acceptor sequence occurs at 14,137. No AATAAA sequence is found aVthe 3' 
end of LF2 so it would presumably be spliced at this end as well. A 22 base-pair 
s«iuence with perfect twofold symmetry is found at positions 10.951-10,972 at the end 
or the LF2 frame. 



(iii) LF3 

This frame lies approximately 600 bases beyond the 3' end of LF2 starting at the 
methionme at 10,413 and extending to 9475 about 150 bases before the deletion point 
sequences are found but a possible splice acceptor sequence is situated at 
nlm' v?"'"' possible initiation codons at the start of the frame at 10,413 and 

10 401. No AATAAA sequence is present near the 3' end although this is close to the 
deletion point. A possible splice donor sequence occurs approximately 50 bases before 
the end of the frame at 9526. A membrane role is predicted from the amino acid 
sequence which contains a regular array of hydrophobic regions separated by regions 
of charged amino acids (see below). 

(iv) RF4 

A a'V"*?'." '"'^'"^ ^^"'^ overlapping LF2 is found on the R strand. No promoter or 
AATAAA sequences arc found near this frame. Several possible donor and acceptor 
splice sequences are present (13.038. 13,163, 13.190. 13.320 and 13.403) suggesting ihat 
perhaps only parts of the overlapping reading frame RF4 might be used 

Although the sequence of the £coRI D het fragment is known (unpublished results) 
we have not yet overlapped the C and D het sequences to exclude the possibility of a 
small undetected EcoRl fragment between them. Therefore we have not correlated the 
D het reading frames with those occurring between the LI promoter and the 3' end of 
the EcoRl C fragment. 



(4) AATAAA sequences 

.^''!. ^"'^ ^^"^ ^""^ AATAAA sequences close to their 3' ends 
suggesting that these sequences play a role in their mRNA maturation and 

omet m« 1 1' H '""""^ ^"^"^ ^^TAAA sequenL are 

oS r. t ""'*,"°"-«'d'"8 ^«q"«"«s of mRNAs. Benoist e, al. (1980) 

noted that there was a homologous sequence in the 30 bases downstream of the 
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AATAAA sequence in some mRNAs with the consensus sequence TTTTCACTGC. 
Except for a seven out of ten match with a sequence near the AATAAA at 10,573 
neither those AATAAA sequences associaicd with the ends of reading frames nor the 
isolated AATAAA sequences displayed any homology with this consensus sequence. In 
some cases possible splice acceptor sequences are seen associated with short reading 
frames near the other AATAAA sequences. Three such sequences occur upstream from 
the AATAAA (position 922) at 792. 840 and 877 and in another case at 9049 preceding 
the AATAAA sequence at 9081. Exceptions to the AATAAA sequences are known; 
several cases of ATTAAA (listed by Ahmed et al„ 1982) and one of ATAA (Setzer ei 
aJ., 1980). No ATTAAA sequences are found in the EcoRl C fragment and we have 
not searched for the sequence ATAA on the grounds that it is funcaonally very rare 
and not unique enough, i.e. not long enough to consider all possible occurrences in this 
sequence. 

(5) Coding capacity 

The EcoRl C fragment represents approximately (0% of the EBV genome and the 
sequence analysis of this fragment provides the first opportunity to assess the coding 
potential and arrangement of the genome. From an analysis of the sequence we would 
predict at least six to eight proteins coded in this region. However, with multiple choice 
splicing as found in adenovirus this could be a conservative estimate. The open reading 
frames cover most of the DNA sequence and their arrangement with respect to each 
other coupled with their similarity of codon usage (enhanced third position GC) lead 
us to expect that most if not all of the indicated open frames are expressed into 
protein. The conclusion is further strengthened by the finding of AATAAA sequences 
near the 3' ends of half of the large open reading frames which could participate in 
their mRNA maturation and polyadcnylation. The lack of these sequences at the ends 
of the other open frames coupled with the lack of obvious promoter sequences near the 
5' end of most of the frames would suggest that RNA splicing is involved in their 
putative expression. In most cases there are good candidates for splice acceptor and 
donor sites near their 5' and 3' ends, respectively. However, given the possible 
redundancy in these junction sequences, these are difficult to predict with any certainty. 

There are several examples of coding economy in the sequence, which is perhaps 
surprising in such a large viral genome. The termination codon for LFl and the 
possible initiation codon for LF2 are only two bases apart. Similarly those for LF5 
and LF6 are only one base apart. The possible initiation codons for RFl and LF5 are 
overlapping on the opposite strands. The AATAAA sequences associated with the 
frames RF3, 3a and LF4 which face each other on opposite strands are only 20 bases 
apart. Thus the mRNA ends would be expected to overlap on the opposite strands by 
about 20 bases. Also the closeness of the AATAAA sequences to the termination 
codons arc characteristic of the compactness of viral genomes. 

(6) Transcription 

Three promoters were mapped by Farrell et ai (1983) in this fragment using the 
HeLa cell extract described by Manley et ai (1980). This in vitro system has been 
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Shown to be an accurate reflection of the in vivo situation for some but not all 
promoters Thus, for example, late viral promoters depending on a viraJ function 
might not be expected to be active in this system. This situation has been argued for 
another herpes virus HSVl gene (Frink et al.. 1981). Tl,us it is possible that this region 
contams other promoters not detected in vitro. A number of possibilities have been 
menuoned previously in the text. If a viral product is indeed required to turn on 
transcnption from these promoters then their detection may depend on the 
development of an EBV in vitro system using a productively EBV infected cell extract 
B95?RM^/; * ('^'82^) •'^ve mapped a large number of early and late 

Eem^ ln?hT''»r ^5^^''"'''"'r ^'^"^ ^"^^ ^'riction enzyme 
fragments. In the £eoRI C fragment there are five BamHI sites so we can attempt to 
correlate some of the transcripts with the open reading frames. In BamHI T and X fW 
in notation of Hummel & KicflT) a 5 0 kb early and a 2-7 kb late transcript have been 
mapped These do not extend into the BamHI V fragment (see Fig. 2) (V, in their 
nomenclature) so that it is likely that these transcripts come from <he LF5 and LF6 
readmg frames. RFl and RF2 are almost totally situated in BamHI V (V ) Four 
transcnpts are mapped to this fragment of sizes 6-4 kb. 5-5 kb. 3 2 kb and 2-5 kb As 
this fragment ,s only approximately 3 kb in length the relationship of the four 

o hvhHH """^ " Al^o these transcripfs do not seem 

0 hybridize to the adjacent BamHI fragments, so they could represent multiple 

Sf hTw*^"""" '° ''"^ ^ frames. Se 

represents the RF3a reading frame transcribed from the R promoter. AimHI I 
contains two transcripts, the 1-4 kb late transcript assigned to RF3a and a 11 kb late 
transcnpt LF4 and LF3 and part of LF2 are in this fragment. LF3 and LF4 are 
suiuble sizes with LF4 being the best candidate with a po^ble promoter or spj« 
acceptor sequent at the 5' end and a AATAAA sequent al the 3 "nd o^h« 
transcripts weredetected that could correspond to the large LF2 frame. A 3 0 kb RNA 
has been mapped to the BamHI A fragment of the EcoRI C fragment TTiis could 
correspond to the LFI (857 codons) or the LF2 (I0I5 codons) frames. 
Ihcrefore, although some correspondence can be found between the two sets of 

^^rf . ? tfLT " J""P '^8^°" a combination of probing the 

S nes S ,S A' '«^t"<°Pha6e M13 clone bank of over 500 

fAnSpiL sequencing a cDNA Ubrary of the viral 

(7) Early membrane antigens and the B95-8 deletion 

anf d^ffere^t' n?o""T' ^""'^u strains, the 13-6 kb deletion 

anL*S; °' ^^'^ 8P220 (which are 

SSSSbcS Jfo .s " '"'^^^t in these membrane antigens 

Tmo ^^^, ° j^Jt EBV infections in tissue culture (Hoffman et 

^ I'SOjThorley-Uwson&Poodry, 982) These 

glycoproteins or parts of them might therefore be used for immuni JT^eoplfag^nS 
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EBV infection. Such a procedure might be applied to EBV seronegative adolescents to 
reduce the risk of contraction of infectious mononucleosis. Another use of such an 
immunization might be to assess the causative role of EBV infection in African 
Burkitt's lymphoma by immunizing very young children from the appropriate 
geographic region against EBV infection and comparing the subsequent incidence of 
the disease among theri^ with a non-immunized group (Epstein, 1976). The membrane 
antigens gp350 and gp220. which arc designated VEl and VE2 by Wells et al. (1982), 
are also thought to play an important role in the extreme selectivity of EBV binding to 
cell surfaces. 

Whether there is a direct genolype/phcnotypc relationship between the DNA 
deletion and altered expression of the membrane antigens is uncertain, but the two 
leftward reading frames (LF3 and LF4) around the deletion point have features 
consistent with a membrane glycoprotein. Computerized analysis of the hydropathy 
(Kyte & Doolittle, 1982) shows that LF3 has alternating stretches of hydrophobic 
amino acids with charged residues in between and is probably a membrane protein. 
The most likely arrangement for this polypeptide in a membrane is shown in Figure 4. 
LF4 has a large number of potential glycosylation sites (sequence Asn-X-Thr/Ser). It is 
possible that some or all of the leftward reading frames could be spliced together to 
produce the antigenically related glycoproteins of molecular weights 350,000 and 
220,000 (gp350 and gp220). The presence of the deletion between LF3 and LF4 could 
in some way affect the levels of expression of the reading frames in this region. 

The fact that the intact glycoproteins contain carbohydrate means that we cannot 
directly correlate the sizes of the polypeptides predicted from open reading frames with 
the sizes of the proteins. A similar problem arises in trying to correlate proteins 
produced by in vitro translation (where little or no glycosylation occurs) with the 




Figure 4. Putative folding of the LF3 polypeptide through a lipid bilayer. 
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complete proteins found in vivo. Perhaps by cloning sections of the viral DNA into 
eukaryotic expression vectors, iransfecling those into suitable tissue-culture cells and 
looking for production of the complete viral protein it will be possible to assign 
proteins unequivocally to the DNA sequence. 
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